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ABSTRACT 

In  this  paper,  we  describe  a  reversible  letter-to-sound/sound- 
to-letter  generation  system  based  on  an  approach  which  com¬ 
bines  a  rule-based  formalism  with  data-driven  techniques.  We 
adopt  a  probabilistic  parsing  strategy  to  provide  a  hierarchical 
lexical  analysis  of  a  word,  including  information  such  as  mor¬ 
phology,  stress,  syllabification,  phonemics  and  graphemics. 
Long-distance  constraints  are  propagated  by  enforcing  local 
constraints  throughout  the  hierarchy.  Our  training  and  test¬ 
ing  corpora  are  derived  from  the  high-frequency  portion  of 
the  Brown  Corpus  (10,000  words),  augmented  with  markers 
indicating  stress  and  word  morphology.  We  evaluated  our 
performance  based  on  an  unseen  test  set.  The  percentage 
of  nonparsable  words  for  letter-to-sound  and  sound-to-letter 
generation  were  6%  and  5%  respectively.  Of  the  remaining 
words  our  system  achieved  a  word  accuracy  of  71.8%  and 
a  phoneme  accuracy  of  92.5%  for  letter-to-sound  generation, 
and  a  word  accuracy  of  55.8%  and  letter  accuracy  of  89.4% 
for  sound-to-letter  generation.  We  also  compared  our  hierar¬ 
chical  approach  with  an  alternative,  single-layer  approach  to 
demonstrate  how  the  hierarchy  provides  a  parsimonious  de¬ 
scription  for  English  orthographic-phonological  regulairities, 
while  simultaneously  attaining  competitive  generation  eiccu- 
racy. 


INTRODUCTION 

This  paper  describes  a  trainable  probabilistic  system 
for  reversible  letter-to-sound/sound-to- letter  generation. 
Sound-to-letter  generation  is  a  crucial  aspect  in  the  prob¬ 
lem  of  automatic  detection/incorporation  of  new  words, 
which  is  in  turn  critical  for  the  development  of  large  vo¬ 
cabulary  speech  understanding  systems.  Moreover,  letter- 
to-sound  generation  will  continue  to  be  important  for 
speech  output,  especially  in  applications  such  as  read¬ 
ing  machines.  To  successfully  achieve  our  goal,  several 
important  issues  must  be  addressed.  First,  what  should 
be  the  inventory  of  linguistic  or  lexical  units  for  describ¬ 
ing  English  orthographic-phonological  regularities?  Sec¬ 
ond,  how  should  these  units  be  incorporated  into  the 
representation  of  English  orthography  and  phonology? 
Third,  what  algorithms  can  be  used  to  synthesize  and  em- 
alyze  the  spelling  and  prommciation  of  an  English  word 

^This  research  was  supported  by  ARPA  under  Contract  N00014- 
89-J-1332,  monitored  through  the  Office  of  Naval  Research,  and  a 
grant  from  Apple  Computer  Inc. 


in  terms  of  these  lexical  units?  These  three  issues  will 
be  addressed  in  detail  in  the  following  when  we  describe 
our  approach  and  report  on  our  system’s  performance 
for  both  letter-to-sound  [1]  and  sound-to-letter  genera¬ 
tion  [2].  The  novel  features  of  our  approach  include  the 
reversibility  of  the  combined  parsing  and  generative  pro¬ 
cesses,  the  ability  to  provide  multiple  output  hypotheses, 
the  capability  of  handling  uncertainty  in  the  input,  as 
well  as  our  treatment  of  non-parsable  words. 

PREVIOUS  WORK 

Letter-to-Sound  Generation 

One  of  the  first  approaches  adopted  for  letter-to-sound 
generation  is  typified  by  MITalk  [8] .  It  follows  the  theo¬ 
ries  of  generative  grammar  and  the  transformational  cy¬ 
cle  as  proposed  by  Chomsky  and  Halle  [3].  A  large  set  of 
ordered  cyclical  rules  are  applied  in  turn  to  the  word  in 
question  until  a  final  pronunciation  emerges.  While  the 
process  of  establishing  the  appropriate  rule  set  was  te¬ 
dious  and  time-consuming,  the  resulting  system  achieved 
a  degree  of  accuracy  that,  to  our  knowledge,  has  not  yet 
been  matched  by  other  more  automatic  techniques. 

Because  the  generation  of  cyclical  rules  is  a  difficult 
and  complicated  task,  several  research  groups  have  at¬ 
tempted  to  acquire  letter-to-sound  generation  systems 
through  automatic  or  semi-automatic  data-driven  tech¬ 
niques,  based  on  neural  nets  or  on  an  information  theo¬ 
retic  approach.  Typically,  the  goal  is  to  provide  as  little 
a  'priori  information  as  possible,  ideally,  only  a  set  of 
pairings  of  letter  sequences  with  corresponding  (aligned 
or  unaligned)  phone  sequences.  Iterative  training  algo¬ 
rithms  then  produce  a  probability  model  that  is  applied 
to  predict  the  most  likely  pronunciation.  Probably  the 
best  known  of  these  systems  is  NETtalk  [4] ,  which  learns 
a  pronunciation  of  the  current  letter  by  considering  the 
six  surrounding  letters  as  input  to  the  neural  network. 
Lucassen  and  Mercer  [5]  acquired  a  set  of  rules  automat¬ 
ically  from  a  large  lexicon  of  phonetically  labelled  data 
by  growing  decision  trees  using  a  criterion  based  on  mu¬ 
tual  information.  Although  direct  comparisons  of  per¬ 
formance  of  different  systems  is  difficult  due  to  the  lack 
of  standardized  phone  sets,  data  sets,  or  scoring  algo- 
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rithms,  these  systems  have  reported  phone  accuracies  in 
the  low  90’s  in  terms  of  the  percent  of  letters  correctly 
pronounced. 

Sound-to-Letter  Generation 

To  our  knowledge,  there  has  been  very  little  previous 
work  reported  in  the  literature  addressing  the  problem 
of  sound-to-letter  generation.  We  are  aware  of  only  two 
prior  research  efforts  in  this  area. 

Lucas  and  Damper  [6]  developed  a  system  for  bi¬ 
directional  text-phonetics  translation  using  two  neural 
networks  to  perform  statistical  string  translation.  This 
system  does  not  require  pre-aligned  text-phonetic  pairs 
for  training,  but  instead  tries  to  infer  appropriate  seg¬ 
mentations  and  alignments.  In  a  phonetics-to-text  trans¬ 
lation  task  using  two  disjoint  2, 000- word  corpora  for  train¬ 
ing  and  testing,  they  reported  a  71.3%  letter  and  a  22.8% 
word  accuracy. 

Another  related  effort  was  conducted  by  Alieva  and 
Lee  [7],  who  used  HMMs  to  model  the  acoiLstics  of  train¬ 
ing  sentences  based  on  the  orthographic  transcriptions. 
Context-dependent  quad-letter  acoustic  models  were  train¬ 
ed  with  15,000  sentences,  and  used  in  conjunction  with  a 
5-gram  letter  language  model.  Testing  on  a  disjoint  cor¬ 
pus  of  30  embedded  and  end-point  detected  words  (place 
and  ship  names)  gave  a  39.3%  letter  error  rate  and  21.1% 
word  accuracy.  However,  this  result  is  not  directly  com¬ 
parable  to  our  work  because  the  phonemic/phonetic  rep¬ 
resentation  is  bypassed. 

A  HIERARCHICAL  LEXICAL 
REPRESENTATION 

It  has  long  been  realized  from  research  in  speech  syn¬ 
thesis  that  a  variety  of  linguistic  knowledge  sources  play 
an  important  role  in  determining  English  letter/sound 
correspondences  [8].  For  example,  part-of-speech  causes 
the  noun  and  verb  forms  of  “record”  to  be  pronounced 
differently.  A  morphological  boundary  causes  the  letter 
sequence  “sch”  in  ’’discharge”  to  be  realized  differently 
from  that  in  “school”  or  “scheme”.  Stress  changes  the 
identity  of  vowels  in  a  word,  e.g.  “define”  vs.  “defini¬ 
tion”  .  Also,  syllabic  constraints  are  expressible  in  terms 
of  the  sequential  ordering  of  distinctive  features  -  sonor¬ 
ity  sequencing  in  manner  features,  and  phonotactic  con¬ 
straints  in  place  and  voicing  features.  Furthermore,  there 
are  graphemic  constraints  for  letter  to  letter  transitions. 
A  novel  feature  of  our  system  is  that  multiple  layers  of 
representation  are  incorporated  to  captiure  short  and  long 
distance  constraints.  These  include  word  class,  morphs, 
syllables,  manner  classes,  phonemes  and  graphemes. 
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Figure  1:  Parse  tree  for  “dedicated”  with  different  linguistic 
layers  indicated  numerically. 


morphological  and  phonological  units.  These  units  are 
organized  as  a  hierarchical  tree  structure,  where  the  var¬ 
ious  levels  of  linguistic  knowledge  are  collectively  used  to 
describe  orthographic-phonological  regularities.  Figure  1 
illustrates  the  description  of  the  word  “dedicated” . 

The  higher  levels  encode  longer  distance  constraints, 
while  the  lower  levels  carry  more  local  constraints.  By 
allowing  the  terminal  nodes  to  be  dual  in  nature  (i.e., 
representing  either  phones  or  letters),  we  can  create  di¬ 
rect  symmetry  between  the  letter-to-sound  and  sound- 
to-letter  generation  tasks  simply  by  swapping  the  in¬ 
put/output  specification. 

One  should  note  in  Figure  1  that  [’*’]  is  a  graphemic 
“place-holder”  introduced  to  maintain  consistency  be¬ 
tween  the  representations  of  the  words  “dedicate”  and 
“dedicated”,  where  an  inflexional  suffix  [isuf]  has  been 
attached  to  the  latter  word.  Another  noteworthy  detail 
is  the  special  [m-ONSET]  category,  which  signifies  that 
the  letter  ‘c’  should  belong  to  the  root  “-die-”  but  has, 
become  a  moved  onset  of  the  next  syllable  due  to  syllab¬ 
ification  principles  such  as  the  Maximal  Onset  Principle 
eind  the  Stress  Resyllabification  Principle.^ 


THE  PARSING  ALGORITHM 

We  are  adopting  a  technique  that  represents  a  cross 
between  explicit  rule-driven  strategies  and  strictly  data- 
driven  approaches.  About  100  generalized  context-free 
rules,  such  as  those  illustrated  in  Table  1  are  written 
by  hand,  and  training  words  are  parsed  using  TINA  [9], 
according  to  their  marked  linguistic  specifications.  The 
parse  trees  of  format  as  show  in  Figure  1  are  then  used 


^According  to  Webster’s  New  World  Dictionary,  the  root  of 
“dedicated”  is  “-die-”,  which  is  derived  from  the  Latin  word 
“dicare”. 


We  created  a  framework  which  describes  the  spelling 
and  pronunciation  of  English  words  using  only  a  small 
inventory  of  labels  associated  with  the  aforementioned 


®The  Maximal  Onset  Principle  states  that  the  number  of  conso¬ 
nants  in  the  onset  position  should  be  maximized  when  phonotactic 
and  morphological  constraints  permit,  and  Stress  Resyllabification 
refers  to  maximizing  the  number  of  consonants  in  stressed  syllables. 


290 


word  — >  [prefix]  root  [sufHx] 

root  — >  stressed-syllable  [reduced-syllable] 

stressed-syllable  [onset]  nucleus  [coda] 

nucleus  —>■  vowel 

nasal  — »  (/m/  /n/  /i]/) 

/m/  ->  (“m”  “me”  “mn”  “mb”  “mm”  “mp”) 

Table  1:  Example  rules  at  each  of  the  different  layers. 


to  train  the  probabilities  in  a  set  of  “layered  bigrams” 
[10].  We  have  chosen  a  probabilistic  parsing  paradigm 
for  four  reasons:  First,  the  probabilities  serve  to  augment 
the  known  structural  regularities  that  can  be  encoded  in 
simple  rules  with  other  structural  regularities  which  may 
be  automatically  discovered  from  a  large  body  of  training 
data.  Secondly,  since  the  more  probable  parse  theories 
are  distinguished  from  the  less  probable  ones,  search  ef¬ 
forts  can  selectively  concentrate  on  the  high  probability 
theories,  which  is  an  effective  mechanism  for  perplex¬ 
ity  reduction.  Thirdly,  probabilities  are  less  rigid  than 
rules,  and  adopting  a  probabilistic  framework  allows  us 
to  easily  generate  multiple  parse  theories.  Fourthly,  the 
flexibility  of  a  probabilistic  framework  also  enables  us  to 
automatically  relax  constraints  to  attain  better  coverage 
of  the  data. 

Training  Procedure 

The  layered  bigrams  formalism  attaches  probabili¬ 
ties  to  sibling-sibling  transitions  in  context-free  grammar 
rules.  It  has  been  shown  to  achieve  a  low  perplexity 
at  the  linguistic  level  within  the  axis  domain  [10].  For 
our  current  sub-word  application,  we  have  modified  the 
layered-bigrams  in  two  ways:  (1)  parse  trees  are  gener¬ 
ated  in  a  bottom-up  fashion  instead  of  top-down,  and 
(2)  the  contextual  information  used  in  bottom-up  pre¬ 
diction  includes  the  complete  history  in  the  immediate 
left  column. 

Our  experimental  corpus  consists  of  the  10,000  most 
frequent  words  appearing  in  the  Brown  Corpus  [11],  where 
each  word  entry  contains  a  spelling  and  a  single  unaligned 
phoneme  string.  We  used  about  8,000  words  for  training, 
and  a  disjoint  set  of  about  800  words  for  testing. 

The  set  of  training  probabilities  are  estimated  by  tab¬ 
ulating  counts  using  the  training  parse  trees.'^  It  includes 
bottom-up  prediction  probabilities  for  each  category  in 
the  parse  tree,  and  column  advancement  probabihties  for 
extending  a  column  to  the  next  terminal.  The  saune  set  of 
probabilities  are  used  for  both  letter-to-sound  and  sound- 
to-letter  generation. 

Testing  Procedure 

In  letter-to-sound  generation,  the  system  takes  in  a 
spelling  as  an  input,  generates  a  parse  tree  in  a  bottom- 


^See  [1]  for  a  more  detailed  description  of  this  process. 


up  left-to- right  fashion,  and  derives  a  phonemic  pronunci¬ 
ation  from  the  complete  parse.  In  sound-to-letter  gener¬ 
ation,  the  system  accepts  a  string  of  phonemes  as  input, 
and  generates  letters.  An  inadmissible  stack  decoding 
search  algorithm  is  adopted  for  its  simplicity.  If  multiple 
hypotheses  are  desired,  the  algorithm  can  terminate  af¬ 
ter  multiple  complete  hypotheses  have  been  popped  off 
the  stack.  These  hypotheses  are  subsequently  re-ranked 
according  to  their  actual  parse  score.  Though  our  search 
is  inadmissible,  we  are  able  to  obtain  multiple  hypotheses 
inexpensively  with  satisfactory  performance. 

EXPERIMENTAL  RESULTS 

Experiments  on  both  letter-to-sound  and  sound-to- 
letter  generation  were  conducted  using  26  letters,  one 
graphemic  place-holder  and  52  phonemes  (including  sev¬ 
eral  unstressed  vowels  and  pseudo  diphthongs  such  as  /o 
r/).  Each  entry  in  the  test  corpus  contains  a  spelling 
corresponding  to  a  single  pronunciation.  The  genera¬ 
tion  procedures  use  evaluation  criteria  that  directly  mir¬ 
ror  one  another.  Word  accuracy  is  the  percentage  of 
parsable  words  for  which  the  top-ranking  theory  gener¬ 
ates  a  spelling/pronunciation  that  matches  the  lexical 
entry  exactly.  Non-parsable  words  are  those  for  which 
no  spelling/pronunciation  output  is  produced.  “Top  N” 
word  accuracy  refers  to  the  percentage  of  parsable  words 
for  which  the  correctly  generated  spelling/pronunciation 
appears  in  the  top  N  complete  theories.  Letter/Phoneme 
accuracies  include  insertion,  substitution  and  deletion  er¬ 
ror  rates,  and  are  obtained  using  the  program  provided 
by  NIST  for  evaluating  speech  recognition  systems. 

Results  on  Letter-to-Sound  Generation 

In  letter-to-sound  generation,  about  6%  of  the  test  set 
was  nonparsable.  This  set  consists  of  compound  words, 
proper  names,  and  words  that  failed  due  to  sparse  data 
problems.  Results  for  the  parsable  portion  of  the  test  set 
are  shown  in  Table  2.  The  69.3%  word  accuracy  corre¬ 
sponds  to  a  phoneme  accuracy  of  91.7%,  where  an  inser¬ 
tion  rate  of  1.2%  has  been  taken  into  account. 

Thus  far  there  are  no  standardized  evaluation  meth¬ 
ods  for  text-to-speech  systems,  and  therefore  comparison 
among  different  systems  remains  difficult.  Errors  in  the 
generated  stress  pattern  and/or  phoneme  insertion  er¬ 
rors  are  often  neglected.  Evaluation  criteria  that  have 
been  used  include  word  accuracy,  accuracy  per  phoneme 
and  accuracy  per  letter  (in  measuring  the  accuracy  per 
letter,  silent  letters  are  regarded  as  mapping  to  a  [null] 
phone).  We  believe  that  accuracy  per  letter  would  gener¬ 
ally  be  higher  than  accuracy  per  phoneme,  because  there 
are  generally  more  letters  than  phonemes  per  word,  and 
the  letters  mapping  to  the  generic  category  [null]  would 
usually  be  correct.  To  verify  our  claim,  we  computed  the 
two  measurements  based  on  our  training  set,  using  the 
alignment  provided  by  the  training  parse  trees.  Our  re- 
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Table  2:  Letter-to-Sound-Generation  Experiments:  Word 
and  Phoneme  Accuracy  for  IVaining  and  Testing  data 


Figure  2:  Letter-to-Sound:  Percent  correct  whole-word  the¬ 
ories  as  a  function  of  iV-best  depth  for  the  test  set 


suit  shows  that  a  per  letter  measurement  would  lead  to 
a  10%  reduction  in  error  rate. 

Figure  2  is  a  plot  of  cumulative  percent  correct  of 
whole  word  theories  as  a  function  of  the  iV-best  depth 
for  the  test  set.  Although  30  complete  theories  were  gen¬ 
erated  for  each  word,  no  correct  theories  occur  beyond 
N  =18  after  resorting,  with  an  asymptotic  value  of  just 
over  89%. 

Results  on  Sound-to-Letter  Generation 

In  sound-to-letter  generation,  about  4%  of  the  test 
set  was  nonparsable.  Results  for  the  parsable  words  are 
shown  in  Table  3;  top-choice  word  accuracy  for  sound-to- 
letter  is  about  52%.  This  corresponds  to  a  letter  accu¬ 
racy  of  88.6%,  with  an  insertion  error  rate  of  2.5%  taken 
into  account.  This  performance  compares  favorably  with 
those  reported  in  previous  work. 

Figure  3  is  a  plot  of  the  cumulative  percent  correct 
(in  sound-to-letter  generation)  of  whole  word  theories  as 
a  function  of  iV-best  depth  of  the  test  set.  The  asymp¬ 
tote  of  the  graph  shows  that  the  first  30  complete  the¬ 
ories  generated  by  the  parser  contain  a  correct  theory 
for  about  83%  of  the  test  words.  Within  this  pool,  re¬ 
sorting  using  the  actual  parse  score  has  put  the  correct 
theory  within  the  top  10  choices  for  about  81%  of  the 
cases,  while  the  remaining  2%  have  their  correct  theo¬ 
ries  reuiked  between  AT  =  10  and  N  =  30.  Resorting 
seems  to  be  less  effective  in  the  sound-to-letter  case,  pre¬ 


Table  3:  Sound-to-Letter  Generation  Experiments:  Word 
and  Letter  Accuracy  for  Training  and  Testing  data 


Figure  3:  Sound-to-Letter:  Percent  correct  whole-word  the¬ 
ories  as  a  function  of  A-best  depth  for  the  test  set 


sumably  because  many  more  “promising”  theories  can 
be  generated  than  for  letter-to-soimd.  A  possible  rea¬ 
son  for  this  is  the  ambiguity  in  phoneme-to-letter  map¬ 
ping,  and  another  reason  is  that  geminant  letters  are  of¬ 
ten  mapped  to  the  same  (consonantal)  phoneme.  For 
example,  the  generated  spellings  from  the  pronunciation 
of  “connector”  i.e.,  the  phoneme  string  (k  i  n  e  k  t  af), 
include:  “conecter”,  “conector”,  “connecter”,  “connec¬ 
tor”,  “conectar”,  “conectyr”,  “conectur”,  “connectyr”, 
“connectur”,  “conectter”,  “connectter”  and  “cannecter”. 
Many  of  these  hypotheses  can  be  rejected  with  the  avail-' 
ability  of  a  large  lexicon  of  legitimate  English  spellings. 

Error  Analyses 

Both  of  the  cumulative  plots  shown  above  reach  an 
asymptotic  value  well  below  100%.  The  words  that  be¬ 
long  to  the  portion  of  the  test  set  lying  above  the  asymp¬ 
tote  appear  intractable  -  a  correct  pronunciation/spelling 
did  not  emerge  as  one  of  the  30  complete  theories.  De¬ 
tailed  analysis  of  these  words  shows  that  they  fall  into  ap¬ 
proximately  4  categories.  (1)  Generated  pronunciations 
that  have  subtle  deviations  from  the  reference  strings.  (2) 
Unusual  prommciations  due  to  influences  from  foreign 
leinguages.  (3)  Generated  pronunciations  which  agree 
with  the  regularity  of  English  letter-phoneme  mappings, 
but  were  nevertheless  incorrect.  (4)  Errors  attributable 
to  sparse  data  problems.  Some  examples  are  shown  in 
Table  4.  It  is  interesting  to  note  that  there  is  mudi  over¬ 
lap  between  the  set  of  problematic  words  in  letter-to- 
sound  and  sound-to-letter  generation.  This  implies  that 
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Category 

correct 

spelling 

generated 

spelling 

generated 

pronunciation 

correct 

pronunciation 

(1) 

Subtle 

acquiring 

equiring 

ikwa^riq 

ikwa^yiq 

balance 

balence 

correct 

baelins 

launch 

lawnch 

correct 

lone 

pronounced 

pronounst 

pnna'*'nst 

pro'*'na'^nst 

(2) 

Unusual 

champagne 

shampain 

cffimpignF 

saempe^'n 

debris 

dibree 

di^'bns 

dibri>' 

(3) 

Regular 

basis 

correct 

bssis 

be^'sis 

elite 

aleat 

ilo^t 

ilFt 

violence 

viallence 

correct 

va^ilins 

viscosity 

viscossity 

visko*siti>' 

viskasiti’' 

(4) 

Sparse 

braque 

brack 

braekwF 

braek 

Table  4:  Some  examples  of  generation  errors 


improvements  made  in  one  generative  direction  should 
carry  over  to  the  opposite  direction  as  well. 

EVALUATING  THE 
HIERARCHY 

We  believe  that  the  higher  level  linguistic  knowlege 
incorporated  in  the  hierarchy  is  important  for  our  gener¬ 
ation  tasks.  Consequently,  we  would  like  to  empirically 
assess:  (1)  the  relative  contribution  of  the  different  lin¬ 
guistic  layers  towards  generation  accuracy,  and  (2)  the 
relative  merits  of  the  overall  design  of  the  hierarchical  lex¬ 
ical  representation.  Our  studies  [13]  are  based  on  letter- 
to-sound  generation  only,  although  we  expect  that  the 
implications  of  our  study  should  carry  over  to  soimd-to- 
letter  generation. 

Investigations  on  the  Hiereirchy 

The  implementation  of  our  parser  is  flexible,  in  that  it 
can  train  and  test  on  a  variable  number  of  layers  in  the 
hierarchy.  This  enables  us  to  explore  the  relative  con¬ 
tribution  of  each  linguistic  level  in  the  generation  task. 
We  conducted  a  series  of  experiments  whereby  an  in¬ 
creasing  amount  of  linguistic  knowledge  (in  terms  of  the 
number  of  layers  in  the  hierarchy)  is  omitted  from  the 
training  parse  trees.  For  each  reduced  configuration,  the 
system  is  re-trained  and  re-tested  on  the  same  training 
and  testing  corpora  as  described  earlier.  For  each  ex¬ 
periment  we  compute  the  top-choice  word  accuracy  emd 
perplexity,  which  reflect  the  amount  of  constraint  pro¬ 
vided  by  the  hierarchical  representation.  We  also  mea¬ 
sure  the  coverage  to  show  the  extent  to  which  the  parser 
can  generalize  to  account  for  previously  unseen  struc¬ 
tures,  and  count  the  number  of  system  parameters  in 
order  to  observe  the  computational  load,  as  well  eis  the 
parsimony  of  the  hierarchical  framework  in  capturing  En¬ 
glish  orthographic-phonological  regularities.  We  found 
that  for  every  layer  omitted  from  the  representation,  lin¬ 
guistic  constraints  are  lost,  manifested  as  a  lower  gener¬ 


ation  accuracy,  higher  perplexity  and  greater  coverage. 
Fewer  layers  also  require  fewer  training  parameters. 

The  significant  exception  was  the  case  of  omitting  the 
layer  of  broad  classes  (layer  5),  which  seems  to  introduce 
additional  constraints,  thus  giving  the  highest  generation 
performance.  The  word  accuracy  based  on  the  parsable 
portion  of  the  test  set  was  71.8%,®  which  corresponds  to 
a  phoneme  accuracy  of  92.5%.  This  improvement®  can 
be  understood  by  realizing  that  broad  classes  can  be  pre¬ 
dicted  from  phonemes  with  certainty,  and  the  inclusion  of 
the  broad  class  layer  probably  led  to  excessive  smoothing 
across  the  individual  phonemes  within  each  broad  class. 
Again,  about  6%  of  the  test  set  was  nonparsable.  When  a 
robust  parsing  scheme  is  used  to  recover  the  nonparsable 
words,  100%  coverage  was  achieved,  but  performance  de¬ 
grades  to  69.2%  word  and  91.3%  phoneme  accuracy. 

Comparison  with  a  Single-Layer  Approach 

We  also  compared  our  current  hierarchical  framework 
with  an  alternative  approach  which  uses  a  single-layer 
representation.  Here,  a  word  is  represented  mainly  by 
its  spelling  and  an  aligned  phonemic  transcription,  us¬ 
ing  the  [null]  phoneme  for  silent  letters.  The  alignment 
is  based  on  the  training  parse  trees  from  the  hierarchical 
approach.  For  example,  “bright”  is  transcribed  as  /b  r 
NULL  NULL  t/.  The  word  is  then  fragmented  exhaustively 
to  obtain  letter  sequences  (word  fragments)  shorter  than 
a  set  maximum  length.  During  training,  bigram  proba¬ 
bilities  and  phonemic  transcription  probabilities  are  com¬ 
puted  for  each  letter  sequence.  Therefore  this  approach 


®When  normalized  on  the  entire  test  set,  the  word  accuracy 
becomes  67.5%. 

®We  also  found  improvement  on  sound-to-letter  generation  - 
55.8%  word  accuracy  on  the  parsable  test  words,  corresponding  to 
89.4%  letter  accuracy,  and  5%  of  the  words  were  nonparsable. 

^However,  broad  classes  may  still  serve  a  role  as  a  “fast  match” 
layer  in  recognition  experiments,  where  their  predictions  could  no 
longer  be  certain,  due  to  recognition  errors. 
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captiires  some  graphemic  constradnts  within  the  word 
fragment,  but  higher  level  linguistic  knowledge  is  not  ex¬ 
plicitly  incorporated.  Letter-to-sound  generation  is  ac¬ 
complished  by  finding  the  “best”  concatenation  of  letter 
sequences  which  constitutes  the  spelling  of  the  test  word. 

To  facilitate  comparison  with  the  hierarchical  approach, 
we  use  the  same  training  and  test  sets  to  run  letter-to- 
soimd  generation  experiments  with  the  single-layer  ap)- 
proach.  Several  different  value  settings  were  used  for 
the  maximum  word  fragment  length.  We  expect  genera¬ 
tion  eiccuracy  to  improve  as  the  maximum  word  fragment 
length  increases,  because  longer  letter  sequences  can  caj)- 
ture  more  context.  However,  this  should  be  accompanied 
by  an  increase  in  the  number  of  system  parameters  due  to 
the  combinatorics  of  the  letter  sequences.  Furthermore, 
there  are  no  nonparsable  test  words  in  the  single-layer 
approach,  because  it  can  always  “backoff’  to  mapping  a 
single  letter  to  its  most  probable  phoneme. 

The  hierarchical  approach  (without  the  broad  class 
layer)  achieved  the  same  performance  as  the  highest  per¬ 
forming  single-layer  approach,  which  allowed  a  maximum 
fragment  length  of  6.®  The  mean  firagment  length  of  the 
segmentations  used  in  the  test  set  by  the  single-layer  ai>- 
proach  was  3.7,  while  the  mean  grapheme  length  used 
by  the  hierarchical  approach  was  only  1.2.  The  hierar¬ 
chical  approach  is  capable  of  reversible  generation  us¬ 
ing  about  32,000  parameters,  while  the  single-layer  ap¬ 
proach  requires  693,300  parameters  (a  20-fold  increase) 
for  uni-directional  letter-to-sound  generation.  In  order 
to  achieve  reversibility,  the  number  of  parameters  would 
have  to  be  doubled. 

DISCUSSION 

Our  current  work  demonstrates  the  utility  of  a  hi¬ 
erarchical  framework,  which  is  relatively  rich  in  linguis¬ 
tic  knowledge,  for  bi-directional  letter-to-soimd/sound- 
to-letter  generation.  The  use  of  layered  bigrams  in  our 
hierarchy  is  extendable  to  encompass  natural  language 
constraints  [10],  prosody,  discourse  and  perhaps  even  di¬ 
alogue  modeling  constraints  on  top,  as  well  as  phonet¬ 
ics  and  acoustics  at  the  bottom.  As  such  this  paradigm 
should  be  particularly  useful  for  applications  in  speech 
synthesis,  recognition  and  xmderstanding. 

The  versatility  of  our  framework  can  lead  to  a  variety 
of  applications.  These  range  from  a  lexical  representation 
for  large-vocabulary  recognition,  which  provides  seman¬ 
tic  information,  syntactic  information,  and  a  clustering 
mechanism  for  fast  match  [12],  to  a  low-perplexity  lan¬ 
guage  model  for  character  recogition  tasks,  where  our 
system  gives  a  test  set  perplexity  of  8.0  as  constrasted 


with  11.3  firom  a  standard  letter  bigram.  In  the  near  fu¬ 
ture,  we  plan  to  report  on  our  robust  parsing  mechanism 
for  extending  coverage,  and  to  experiment  with  alterna¬ 
tive  search  strategies  in  the  layered  bigrams  framework. 
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